Skip to content

model: Granite4 Vision#23545

Merged
ngxson merged 103 commits into
ggml-org:masterfrom
gabe-l-hart:Granite4Vision
Jun 5, 2026
Merged

model: Granite4 Vision#23545
ngxson merged 103 commits into
ggml-org:masterfrom
gabe-l-hart:Granite4Vision

Conversation

@gabe-l-hart
Copy link
Copy Markdown
Collaborator

@gabe-l-hart gabe-l-hart commented May 22, 2026

Overview

This PR adds support for the Granite4VisionForConditionalGeneration mtmd architecture. It specifically targets the following models:

Additional information

The Granite4Vision models leverage several key architectural patterns that have not been previously supported:

  1. Deepstack + Spatial projectors injected at non-contiguous points in the LLM layer stack
  2. Llava-next encoder / assembler with learned newline token

Because of these two architectural patterns, this PR makes several key architectural shifts in the project:

Arch Changes in libllama

  • llama_hparams.n_deepstack_layers -> llama_hparams.deepstack_layers_arr
    • This allows G4V to inject projector outputs at specific LLM layers
    • Backwards-compatibility is maintained for existing models using n_deepstack_layers, specifically the Qwen3VL family, by loading deepstack_layers_arr as either a single-valued number or a multi-valued array
    • Open Question: This type of try/catch backwards compatibility is not something I've seen elsewhere, so want to see whether this is a strong enough anti-pattern that I should instead just use a net-new hparam with overlapping meaning.

Arch Changes in mtmd

  • Introduced a new class hierarchy for clip_assembler in clip-graph.h that parallels the clip_graph factory pattern
    • This class hierarchy will support models that have graph operations that need to happen after the individual image tiles have been encoded (eg llava-next style with learned newlines)
  • Introduced clip_image_f32.append_token field that can be used by individual graphs to determine how to handle injecting learned newlines for each image tile.
  • New public methods in clip.h to support the model-agnostic assembler logic
    • clip_image_assemble: This is the factory function for using the clip_assembler hierarchy to perform assembly
    • clip_n_assembled_output_tokens: This allows model-specific logic for counting output tokens based on how the assembly will work
  • New hparam section for values that will be explicitly shared between an LLM and its MMPROJ
    • This is needed to bind the embedding_scale value to both the LLM and the MMPROJ so that the base stream can be pre-scaled to invert the embedding scaling that happens in the LLM
  • ^ not needed anymore given skip logic for embedding scale w/ input embeddings
  • clip_hparams.vision_feature_layer changed from an unordered_set to a vector to support a strict ordering and duplicate values. The ordering will map to the order of the projectors, multiple of which will pull from the same vision layer.
  • Hoist QFormer tensors in clip_model into a qf_block struct and hold a vector of them in clip_model
    • This maintains backwards compatibility for Granite Speech which uses a vector of sized 1 while allowing G4V to support multiple blocks

Open Questions

Before merging, I want to address the following open questions:

  • Maintainer alignment on introduction of clip_assembler paradigm Removed in favor of clip_image_f32.append_token
  • Maintainer alignment on hparam try/catch single vs multi value parsing paradigm
  • Mathematical alignment with alternate implementations (transformers and pure Claude implementation, see AI usage disclosure)
  • Is there a cleaner way to skip the f_embedding_scale in llama-graph.cpp if (and only if) the input embeddings have valid image embeddings that doesn't require multimodal knowledge to leak into the core library?

Requirements

AI Usage Disclosure

AI was used a lot in the creation of this PR! That said, the bulk of the work was actually meshing the AI's efforts into the existing architecture in a way that caused the least possible friction. I've annotated each commit with an AI-usage line (see stats below). There were two key ways that AI was used:

  1. Granite Vision teammate @EliSchwartz built a working version of this branch entirely using Claude Code (here). This was heavily used as a reference implementation to check the implementation here that was more closely aligned with project patterns. Sections of this were referenced/copied verbatim (see commits with Co-authored-by).
  2. Various agent/model combinations were used to assist in design/refactor throughout the branch

NOTE: I also failed with AI a bunch of times. Most agent/model combos couldn't handle the complexity of the architectural merger between G4V's architecture quirks and the various components of mtmd.

git-ai-stats

╔══════════════════════════════════════════════════════════╗
║ GIT AI USAGE ANALYSIS ║
╚══════════════════════════════════════════════════════════╝

📊 COMMITS BY AGENT

--- Aggregate ---
Commits | Count

none | 53
OpenCode + qwen3.5:122b | 5
Claude Code + Opus 4.7 | 4
IBM Bob | 1
Claude Code, IBM Bob | 1
OpenCode + Qwen 3.6-35B | 1
Claude Code | 1

TOTAL | 66

📊 COMMITS BY USAGE TYPE

--- Aggregate ---
Commits | Count

none | 53
draft | 6
full | 7

TOTAL | 66

📈 LINES OF CODE BY AGENT

--- Aggregate ---
Agent | Commits | Additions | Deletions

none | 53 | 1355 | 886
OpenCode + qwen3.5:122b | 5 | 50 | 3
Claude Code + Opus 4.7 | 4 | 463 | 296
IBM Bob | 1 | 13 | 0
Claude Code, IBM Bob | 1 | 600 | 7
OpenCode + Qwen 3.6-35B | 1 | 4 | 0
Claude Code | 1 | 30 | 0

TOTAL | 66 | 2515 | 1192

📈 LINES OF CODE BY USAGE TYPE

--- Aggregate ---
Usage Type | Commits | Additions | Deletions

none | 53 | 1355 | 886
draft | 6 | 802 | 27
full | 7 | 358 | 279

TOTAL | 66 | 2515 | 1192

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…ybrid

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
There are several awkward things here:

1. Most of these are essentially identical to the audio qformer tensors. On
the c++ side, that's mapped using the prefix, so the rest of the GGUF
name needs to align, but on the python side there's no prefix notion, so
they all get duplicated.
2. There are a couple of net-new tensors for vision, in particular
PROJ_NORM. In both speech and vision, the QF_PROJ_NORM is qualified as
belonging to the qformer portion, but the GGUF name is simply proj_norm
which conflicts with the ideal name for this new PROJ_NORM that is not
qualified as part of the qformer. To get around this, I used
"proj_layernorm" as the GGUF name.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
NOTE: Usage of these hasn't been updated to include prefix yet

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
We need to preserve the ordering of these feature index values so that they
can be mapped to the sub-tensors within the stacked projectors.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: Granite4Vision
AI-usage: full (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This handles stacking the projector tensors and setting the new harams

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…ack layer arr

Branch: Granite4Vision
AI-usage: draft (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: Granite4Vision
AI-usage: full (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This defaults to False, but allows a user to enable it programmaticly
instead of using the interactive prompt.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: Granite4Vision
AI-usage: full (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…e block

This is cleaner than stacking them. The modeling file hard-codes
single-layer qformers, so we can punt on the multiipule multi-layer
projectors problem.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: Granite4Vision
AI-usage: draft (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
New hparams:
- KEY_PROJ_SAMPLE_QUERY_SIDE
- KEY_PROJ_SAMPLE_WINDOW_SIDE
- KEY_PROJ_SPATIAL_OFFSETS

New tensors:
- TN_MULTI_PROJ_IMG_POS
- TN_MULTI_PROJ_QUERY
- TN_MULTI_PROJ_LAYERNORM
- TN_MULTI_PROJ_LINEAR
- TN_MULTI_PROJ_NORM

Branch: Granite4Vision
AI-usage: none

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
This appears to have been added during Qwen3 VL
(ggml-org#16780), but it was never
actually used.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
The old logic hard coded a correspondence between the first N layers of the
LLM and the 1->N entries in the input embeddings. Now, that relationship is
maintained at loading time if the GGUF value is single-valued. If it is
multi-valued, it loads directly allowing for deepstack layers to be spaced
out throughout the model.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
The alternative would be to use get_key_or_arr, but then the single value
would be populated through the entire array and we'd need to detect that
and update it with the right correspondence.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
The use of ggml_add here assumes that the elements of inp_embd will be pre-
arranged to be the full embedding length with only the vision-mask'ed
portions non-zero from the projector. This matches how Qwen3VL does it.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: Granite4Vision
AI-usage: full (OpenCode + Qwen 3.6-35B)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
…ulti-proj

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* origin/master:
vocab : support tokenizer for LFM2.5-8B-A1B (ggml-org#23826)
graph : ensure DS32 kq_mask_lid is F32 (ggml-org#23864)
server: remove obsolete scripts (ggml-org#23870)
ci : update macos release to use macos-26 runner (ggml-org#23878)
download: add option to skip_download (ggml-org#23059)
mtmd: Add DeepSeekOCR 2 Support (ggml-org#20975)
CUDA: Check PTX version on host side to guard PDL dispatch (ggml-org#23530)
server: bump timeout to 3600s (ggml-org#23842)
model : support for DeepseekV32ForCausalLM with generic DeepSeek Sparse Attention (DSA) implementation (ggml-org#23346)
llama: use f16 mask for FA to save VRAM (ggml-org#23764)
sync : ggml
ggml : bump version to 0.13.1 (ggml/1523)
ngram-mod : Add missing include (ggml-org#23857)
llama: add llm_graph_input_mtp (ggml-org#23643)
app : move licences to llama-app (ggml-org#23824)
cuda : disables launch_fattn PDL enrollment due to compiler bug (ggml-org#23825)
meta : Add missing `buffer` set in allreduce fallback !COMPUTE clear (ggml-org#23480)
This was inherited from the Claude Code implementation that pushed the
negative index inversion down into the model file.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
face. palm. :(

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* origin/master:
server: in SSE mode, send HTTP headers when slot starts (ggml-org#23884)
ggml-webgpu: Check earlier for WebGPU required features (ggml-org#23879)
ggml-webgpu: add q4_0/q8_0 SET_ROWS (ggml-org#23760)
server-bench : add speed-bench for speculative decoding benchmarking (ggml-org#23869)
app: add llama update self updater (ggml-org#23865)
ui: handle audio/vnd.wave as audio WAV file (ggml-org#23754)
@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

Ok, I think this is fully ready @ngxson. I found my two bugs that were causing the mathematical delta with Eli's version, so I've got exact matching output now!

* origin/master: (36 commits)
vendor : update cpp-httplib to 0.46.1 (ggml-org#23980)
llama: limit max outputs of `llama_context` (ggml-org#23861)
metal: template GLU kernels to support f16/f32 (ggml-org#23882)
vulkan: don't hold the device mutex while compiling pipelines (ggml-org#23641)
vulkan: reduce host memory lock contention (ggml-org#23376)
vocab: add normalizer.lowercase support to WPM (ggml-org#23899)
TP: quantized KV cache support (ggml-org#23792)
security : disable private disclosures (ggml-org#23963)
model: Add EXAONE 4.5 implementations (ggml-org#21733)
vulkan: Block-load Q3_K/Q6_K block data and subtract on 32b ints (ggml-org#23056)
vulkan: Removed unused functions (ggml-org#23175)
common : support manually triggering the reasoning budget end sequence (ggml-org#23949)
ci : add missing Linux label to cpu-x64-high-perf runner (ggml-org#23958)
[SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention (ggml-org#23812)
[SYCL] Add more types in GET_ROWS OP (ggml-org#23710)
sycl : Optimize Q3_K mul_mat by reorder (ggml-org#23725)
ci: remove redundant or duplicate jobs (ggml-org#23927)
server : handle If-None-Match weak ETags (ggml-org#23916)
ci : limit trigger paths for the CPU workflow (ggml-org#23938)
vocab : add tokenizer support for jina-embeddings-v2-base-zh (ggml-org#18756)
...
@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

I think these test failures look unrelated since they're in test-backend-ops and this PR doesn't touch any kernels.

* origin/master: (57 commits)
server : disable on-device spec checkpoints (ggml-org#24108)
arg: fix double mtp downloads (ggml-org#24128)
webui: [a11y] fix keyboard navigation issues in chat interface and sidebar (ggml-org#23132)
Move duplicated imatrix code into single common imatrix-loader.cpp (ggml-org#22445)
ui: Fixed packages (ggml-org#24119)
ui: added single line reasoning preview (ggml-org#23601)
return filter to save memory (ggml-org#24125)
convert: Fix Gemma 4 Unified conversion (ggml-org#24118)
ggml: vectorize ggml_vec_dot_q4_1_q8_1 with WASM SIMD128 (ggml-org#22209)
server: avoid unnecessary checkpoint restore when new tokens are present (ggml-org#24110)
agents: refactor, include more guidelines (ggml-org#24111)
webui: fix tool selector toggle/counter, key tools by stable identity (ggml-org#24065)
build : use umbrella Headers directory for XCFramework module map (ggml-org#23974)
server : add header to tools/server/server-http.h (ggml-org#24089)
cmake: skip cvector-generator and export-lora when CPU backend is disabled (ggml-org#24053)
fix(mtmd): handle Gemma 4 audio projector embedding size (ggml-org#24091)
readme : add status badges (ggml-org#24104)
tests : refactor test-save-load-state to accept token input (ggml-org#24073)
metal : reduce rset heartbeat from 500ms -> 5ms (ggml-org#24074)
ggml-webgpu: FlashAttention refactor + standardize quantization support (ggml-org#23834)
...
@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

@ngxson Gentle nudge. This PR should be ready for final review now.

Comment thread conversion/granite.py Outdated
Comment thread convert_lora_to_gguf.py Outdated
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Comment thread tools/mtmd/clip-impl.h Outdated
Comment thread tools/mtmd/clip-model.h Outdated
Comment thread tools/mtmd/clip.h Outdated
Comment thread src/llama-graph.cpp
Comment thread src/models/granite.cpp
std::unordered_set<uint32_t> unique_deepstack_idxs;
for (const auto val : hparams.deepstack_mapping_arr) {
if (val >= 0) {
unique_deepstack_idxs.insert(val);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may worth checking upper bound for val too

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is just counting the number of unique values, so I'm not sure this is the right place to guard against malicious values. That should probably be right above while loading (maybe just an assertion that the values are within the right range)

Comment thread tools/mtmd/clip.cpp Outdated
Comment thread tools/mtmd/clip-model.h Outdated
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
NOTE: format_string is not available in granite.cpp (and including
clip-impl.h to get it doesn't compile, so I think it violates the intended
encapsulation), so std::stringstream is the simplest answer.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
@ngxson ngxson merged commit 64086f2 into ggml-org:master Jun 5, 2026
25 of 37 checks passed
@gabe-l-hart gabe-l-hart deleted the Granite4Vision branch June 5, 2026 15:46
@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

Thanks for all the review help @ngxson !

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Jun 6, 2026

@gabe-l-hart it seems like the test case ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M is broken in tools/mtmd/test.sh, can you investigate it? I'm not sure if it's related to this PR

The test command: ./build/bin/llama-mtmd-cli -hf ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M --image ./tools/mtmd/test-1.jpeg --temp 0 -n 128 --flash-attn on -p 'what is the publisher name of the newspaper?'

@gabe-l-hart
Copy link
Copy Markdown
Collaborator Author

Thanks for flagging it. I'll investigate!

ggerganov pushed a commit to am17an/llama.cpp that referenced this pull request Jun 6, 2026
* feat(convert): Get language model conversion working for 4.1 vision

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(convert): Skip multimodal tensors for GraniteMoeHybrid (vision 4.0)

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Disable vocab padding for non-hybrid models that use GraniteMoeHybrid

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Plumb python-side vision projector names and mappings

There are several awkward things here:

1. Most of these are essentially identical to the audio qformer tensors. On
the c++ side, that's mapped using the prefix, so the rest of the GGUF
name needs to align, but on the python side there's no prefix notion, so
they all get duplicated.
2. There are a couple of net-new tensors for vision, in particular
PROJ_NORM. In both speech and vision, the QF_PROJ_NORM is qualified as
belonging to the qformer portion, but the GGUF name is simply proj_norm
which conflicts with the ideal name for this new PROJ_NORM that is not
qualified as part of the qformer. To get around this, I used
"proj_layernorm" as the GGUF name.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add python side architecture name

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add python-side plumbing for setting FEATURE_LAYERS hparam

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add c++ side tensor naming defines

NOTE: Usage of these hasn't been updated to include prefix yet

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(mtmd): Convert vision_feature_layer to an ordered vector

We need to preserve the ordering of these feature index values so that they
can be mapped to the sub-tensors within the stacked projectors.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(mtmd): Add architecture label plumbing

Branch: Granite4Vision
AI-usage: full (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat(wip): Add partial conversion for mmproj

This handles stacking the projector tensors and setting the new harams

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add gguf_writer and constant support for new hparams and deepstack layer arr

Branch: Granite4Vision
AI-usage: draft (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Full conversion for mmproj w/ tensor mappings

Branch: Granite4Vision
AI-usage: full (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add lm_head skip for mmproj for 4.0

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: De-alias text_config architecture in convert_lora_to_gguf.py

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add --trust-remote-code arg to convert_lora_to_gguf.py

This defaults to False, but allows a user to enable it programmaticly
instead of using the interactive prompt.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: De-alias model.language_model. -> model. for lora adapters

Branch: Granite4Vision
AI-usage: full (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Extend language model tensor dealiasing in adapters

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unnecessary registration for GraniteSpeech in language model

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Plumb through mm prefix formatting for qformer tensors

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Refactor vision projector tensors to use predictor ID as the block

This is cleaner than stacking them. The modeling file hard-codes
single-layer qformers, so we can punt on the multiipule multi-layer
projectors problem.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add spatial offests array hparam conversion

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add stub plumbing for granite vision in mtmd

Branch: Granite4Vision
AI-usage: draft (OpenCode + qwen3.5:122b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add new hparam and tensor naming in clip-impl.h

New hparams:
- KEY_PROJ_SAMPLE_QUERY_SIDE
- KEY_PROJ_SAMPLE_WINDOW_SIDE
- KEY_PROJ_SPATIAL_OFFSETS

New tensors:
- TN_MULTI_PROJ_IMG_POS
- TN_MULTI_PROJ_QUERY
- TN_MULTI_PROJ_LAYERNORM
- TN_MULTI_PROJ_LINEAR
- TN_MULTI_PROJ_NORM

Branch: Granite4Vision
AI-usage: none

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Move deepstack_layer_arr to llm hparam instead of mmproj

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove IS_DEEPSTACK_LAYERS

This appears to have been added during Qwen3 VL
(ggml-org#16780), but it was never
actually used.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: n_deepstack_layers -> deepstack_layer_arr

The old logic hard coded a correspondence between the first N layers of the
LLM and the 1->N entries in the input embeddings. Now, that relationship is
maintained at loading time if the GGUF value is single-valued. If it is
multi-valued, it loads directly allowing for deepstack layers to be spaced
out throughout the model.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use try/catch for single/multi valued deepstack info

The alternative would be to use get_key_or_arr, but then the single value
would be populated through the entire array and we'd need to detect that
and update it with the right correspondence.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add deepstack injection point for granite LLM

The use of ggml_add here assumes that the elements of inp_embd will be pre-
arranged to be the full embedding length with only the vision-mask'ed
portions non-zero from the projector. This matches how Qwen3VL does it.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: add missing vision attn layernorm eps

Branch: Granite4Vision
AI-usage: full (OpenCode + Qwen 3.6-35B)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Hoist qformer tensors into qf_block and hold a vector for multi-proj

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix missing prefix template for TN_QF_PROJ_LINEAR

It's not strictly necessary since vision uses the blockwise version, but it
makes the loading consistent.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add embedding scale and image grid pinpoints hparams in conversion

Also remove dead parsing for self._deepstack_layer_arr

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add mtmd KEY_ section for hparams shared with the LLM

In this case, we need the EMBEDDING_SCALE so we can unscale the image
embeddings to compensate for applying embedding scale to the input
embeddings

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Implement c++ hparam parsing

Branch: Granite4Vision
AI-usage: draft (Claude Code)
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Flatten pinpoints in conversion

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add missing break

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: No reason to have modality prefix for img_pos

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add tensor loading

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(convert): Fix confusion between proj.norm and proj.qformer.layernorm

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use the right portion of speech for tensor loading!

Also plumb through the layernorm -> post_norm naming change

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add logging of deepstack_layers_arr if set

I also changed the print_f output type to int32_t to avoid printing
overflow values for -1. This could cause overflows on the other side, but
I can't imagine a value for any of the current array hparams that would
trigger that.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Make sure input embeddings are cont before f_embedding_scale

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add init and mmproj_embd cases for g4v

The n_mmproj_embd is 1+ to make space for the text embedding and all 8
projectors

Branch: Granite4Vision
AI-usage: draft (Bob)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Invert (h, w) -> (w, h) pinpoints

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Reorder projectors based on llm index and skip the first injection

The multi-projector stack has a strange asymmetry based on how it's
currently implemented for qwen3vl: on the mmproj side, it's all N
projectors, but the output of the "first" (by inp_embd index) projector is
automatically consumed as if it were a standard single-projector mmproj,
so the deepstack portion needs to only contain the 1-N entries.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>

* fix: Fix mmproj hparams in conversion

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>

* fix: Fix ordering/logic for deepstack injection in granite

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>

* fix: Fix preprocessing config to match what the model needs

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>

* wip: Partial port of Eli's implementation

This is still pretty broken, but it's getting closer. It now happily
generates tokens, but the values are quite incorrect still. I suspect it's
caused by the mapping of projectors from safetensors to their respective
orders here.

Also, this implementation breaks encapsulation pretty badly in mtmd_encode.
This will need a big refactor to put the G4V-specific encoding logic
somewhere more appropriate.

Branch: Granite4Vision
AI-usage: draft (Claude Code, Bob)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Eli Schwartz <eliyahu.schwartz@ibm.com>

* fix: Fix the pre-scaling on the input embeddings to correctly invert the scale

We've got tokens! They still don't line up quite right, so something's a
little off, but we're getting much closer now.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: invert embedding multiplier -> base_scale at load

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix setting image_resize_pad after new enum introduced

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add G4V to mmproj mapping in conversion

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Re-add padding disable for non-hybrid hybrid models

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Simplify G4V n_tokens computation

This is slightly more efficient and flexible for when we implement the
unpad cropping. IMO, it's also clearer that it is adding the number of
image_newline tokens (embeddings) to the grid, rather than recomputing the
entire count.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add new clip APIs for post-tile-encoding assembly

Granite 4 Vision uses llava-next style pack-and-unpad which requires
injecting the learned newline after each row of the tile grid. A row here
is a single row of the grid which is composed of (grid_x * cols_per_tile) *
(grid_y * rows_per_tile), so the result is newlines injected in between
individual tile rows, thus not something that can be handled with the
standard llava-uhd block-wise endcoding.

Branch: Granite4Vision
AI-usage: draft (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add model interfaces for granite 4 vision assembler

I'm on the fence about the best organization of this. These free functions
allow the per-architecture logic in clip.cpp to access the model-specific
graph building, but they still require a fair bit of model-specific logic
in clip.cpp which is not ideal.

I think a better approach may be to replicate what is done with the
graph builders themselves (and possibly even make the assembler part of the
model's existing graph builder).

Branch: Granite4Vision
AI-usage: full (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove all g4v-specific branching from mtmd.cpp in favor of clip assembler

Branch: Granite4Vision
AI-usage: full (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor(mtmd): Consolidate assembler logic into clip_assembler class family

Just like `clip_graph` is the base class for building the model-specific
encoder graphs, `clip_assembler` will be the base class for building the
model-specific assembler graphs. This allows the assembly pattern to follow
how the encoder pattern is implemented where the model-specific logic lives
in a subclass co-located with the encoder graph builder that gets
constructed by a simple factory method.

Branch: Granite4Vision
AI-usage: full (Claude Code + Opus 4.7)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Comment improvement

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: granite_vision -> granite4_vision

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove dead codepath for Qwen3VL add_vision_is_deepstack

These pieces were never used on the c++ side (removed there in an earlier
commit), so this is just cleanup that I missed before.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Oops! I did not mean to commit one of my prompt files

But now it's too far back in history to effectively rebase out, even with
interactive and --rebase-merges :(

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add missing <algorithm> include for std::find

It seems that this was already pulled in on some platforms, but not on
others

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix Flake8 warnings in granite conversion module

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove clip_assembler in favor of clip_image_f32.append_token

Per conversation in the PR, the clip_assembler pattern was too invasive.
This is a compromise that limits model-specific blocks to add_media where
each preprocessed tile is annotated with an injection type, after which all
the token counting logic is generic and the newline injection itself is
handled in the graph based on the value for the given tile image.

Branch: Granite4Vision
AI-usage: draft (Bob, OpenCode + Qwen 3.6 35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor(convert): Split n_deepstack_layers and deepstack_layers (array)

Branch: Granite4Vision
AI-usage: full (Bob, OpenCode + Qwen3.6-35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor(src): Handle n_deepstack_layers and deepstack_layers GGUF keys

Branch: Granite4Vision
AI-usage: draft (Bob, OpenCode + Qwen3.6-35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix GGUF key for deepstack_layers_arr

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove pre-scaling embeddings and skip scaling for raw embd inputs

This follows how gemma3 and gemma4 handle embedding scaling by skipping the
multiplier for raw input embeddings.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: deepstack_layers(_arr) -> deepstack_mapping(_arr)

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Fully revert changes to n_deepstack_layers and qwen3vl*

Since we're going to keep the GGUF KVs separate, it makes sense to just
keep the hparams separate too to limit the scope of this branch. The down
side is that n_deepstack_layers and deepstack_mapping_arr are potentially
conflicting.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Revert removal of "is_deepstack_layers" GGUF KV

This KV is not used at all on the c++ side, so it's fully dead, but there's
also no need to conflate this cleanup with the addition of G4V.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unnecessary ggml_cont and build_forward_expand in cbx

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Clean up comments

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Tighter and more flexible code for g4v_build_block

This could be refactored to look a lot more like granite-speech, but the
overall block constructs before/after the qformer are pretty different, so
for now I'm going to leave it as is and just tighten a bit.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unnecessary `unordered_set` include

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Add architecture guard on deepstack_mapping_arr printout

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unnecessary AI-gen comment

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Always initialize deepstack_mapping_arr with -1 values

This was causing `test-llama-archs` to fail, likely due to trying to save
the uninitialized values, then re-loading them. It's safer to always
initialize so that other models don't forget and end up with undefined
behavior.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Remove TODO about block/vs non-block tensor mapping

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Move is_vision_feature_layer logic into clip_hparams

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use a bool for append_token

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Remove unnecessary comment

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove unused get_model api

yikes!

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Rearrange helpers for g4v to be private members and use build_attn

Branch: Granite4Vision
AI-usage: full (Bob, OpenCode + Qwen3.6-35b)
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix off-by-one in vision layer index

This was inherited from the Claude Code implementation that pushed the
negative index inversion down into the model file.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix norm/post_norm mixup in conversion

face. palm. :(

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: More descriptive tensor names

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Apply PR cleanup for new conversion changes

AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix(convert): Remove duplicate V_ENC_EMBD_IMGNL

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: append_token -> add_newline

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Comment cleanup

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Cleaner error handling/checking

NOTE: format_string is not available in granite.cpp (and including
clip-impl.h to get it doesn't compile, so I think it violates the intended
encapsulation), so std::stringstream is the simplest answer.

Branch: Granite4Vision
AI-usage: none
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants